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1. Introduction 

Solving sparse linear systems of equations with many right-hand sides has important appli- 
cations in lattice QCD. One particular approach that has been found to play a key role as one 
approaches light quark masses is deflation of small eigenvalues(see [jl|]-||5|] for a description of 
different deflation techniques). Deflation techniques rely on computing eigenvectors with small 
eigenvalues and projecting over them to speed up the convergence for extra right-hand sides. The 
eigenvectors are computed either separately or concurrently while solving a subset of the right- 
hand sides. Here we consider seed methods [§]-[]9|] for Hermitian positive definite systems. The 
basic idea of Seed methods is to choose one right-hand side as a "seed system" and use the Krylov 
subspace developed for the seed system as a projection space for the remaining right-hand sides. 
At the end of solving the seed system, one has developed a better initial guess for the remaining 
systems. At that point, one has the choice of either repeating the seeding by choosing the second 
right-hand now as a seed system or solving the remaining systems using the Conjugate Gradient 
algorithm and the new initial guesses. Traditionally, the seeding is repeated and the efficiency of 
the method depends on how closely related the right-hand sides are. 

Since deflating small eigenvalues seems to be the dominant factor in speeding up the solution 
of subsequent right-hand sides, we propose three improvements to the standard seed method that 
helps achieve that. The main idea is to make sure that the Krylov subspace used in seeding contains 
good approximations of those important eigenvectors, even though we are not solving explicitly for 
them. First, we seed only once using the first right-hand side. Second, we run CG for the first right- 
hand side past the convergence limit. Third, since roundoff errors will affect the CG algorithm we 
have to apply some form of a reorthogonalization step. In this work we use periodic reorthogo- 
nalization. Because of the need for the reorthogonalization step, one has to save all the Krylov 
vectors during the solution of the first right-hand side. The extra memory and reorthogonalization 
cost has to be taken into consideration as will be discussed (see JT0| ] for a more detailed discussion 
of the method). In section || we describe the Improved Seed-CG algorithm. Results from testing 
the algorithm with a lattice QCD example are given in section ||[ 

2. Improved Seed-CG Algorithm 

Consider the linear systems Mx ] ' = V ; j = 1,2, ...,nrhs, where M is a symmetric (Hermi- 
tian) positive definite matrix. Let x J be the initial guesses and Tq = V — Mx J Q the initial residual 
vectors. In a seed approach, we'll solve the first right-hand side using the Lanczos (CG) algo- 
rithm to build a Krylov subspace = spanfr^Mr^M 2 ^, ..,M m ~ l rq} whose orthonormal bases 
is V = {vi , V2, .., v m } where m is the dimension of the Krylov subspace at which the first right-hand 
side has converged to the desired accuracy. The subspace developed for the first right-hand side is 
used to obtain a better initial guess for the remaining right-hand sides using a Glerkin projection 
such that x ; ;j = 2,3, ..,nrhs are the improved initial guesses. A faster convergence of the extra 
right-hand sides is obtained by using these improved initial guesses with a standard CG solver. If T 
is the tridiagonal matrix obtained in the Lanczos algorithm while solving the first right-hand side, 
then after the first right-hand side is solved we have: 

X j =x } + Vy i; y J = T-\V f r J ); j = 1,2, ..,nrhs. (2.1) 
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In order to update x J at every iteration we use LU factorization of the Lanczos tridiagonal ma- 
trix (in case of indefinite systems a QR or LQ factorization should be used). The LU factorization 
is given by (for m = 5 as an example): 



/ ai A 
A « 2 ft 
ft a 3 ft 
ft a 4 ft 
V ft a 5 J 



\ 



V 
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Using this factorization the small linear systems Ty-i = V*Tq are solved in an incremental way 
during every iteration. The procedure is similar to what is used with the D-Lanczos algorithm [jOj]. 

The vectors built with Lanczos three-term recurrence lose orthogonality as the dimension of 
the subspace becomes large enough for some of the eigenvectors to begin to converge [12], [JT3|] - 
One way to fix this problem is to perform an extra reorthogonalization step. Here, we implement 
a periodic reorthogonalization (PO) [R|,[13]. One chooses a frequency k of performing the PO 
step such that during iteration number i, if i is a multiple of k, we reorthonormalize v, against 
all vectors vi,v 2 , ...,v,-_i and reorthogonalize v;+i against all vectors vi, V2, v,- where Vj+i is the 



un-normalized i+ 1 vector []15fl,[]13fl. The Lanczos version of single seed CG with periodic re- 
orthogonalization is described below. 



Algorithm 1: Seed-CG(m,k) 

1. Start: Choose m, the maximum size of the subspace, and k, the frequency 

of reorthogonalization. The number of right-hand sides is nrhs. For j = 1,2, ..,nrhs, 
set the approximate solution denoted by x 7 equal to the initial guess (which maybe 
the zero vector) and compute the initial residual r } Q . 

2. First Lanczos iteration: A) = ll r oll' Vl = r o/A>; 
f = Mvu «i=vj/; f = f-a l v l ; ft = ||/||. 

3. Linear equations for first iteration: 81 = ai ; w\ = vi /8\ ; 

Ci = A>; x x =x l + £iwi. 

4. Other right-hand sides for first iteration: For j = 2, ..,nrhs, r\ ] ' = v\r ] Q ; 
x j =x j + ri j wi. Set/ = 2. 

5. Lanczos iteration:/ = Mvj — ft_iv;_i; a,- = v]f; f = f — (XiVi. 

6. Linear equations: 1 = j3,_ 1 / <5 ; -_ 1 ; 5, = a,- - 1 ft_ 1 ; 
w,- = (v,- - ft_iWi_i)/$; Ci = -7-iC-i; x 1 + Qw,-. 

7. Other right-hand sides: For j = 2,3, ..,nrhs , T] J ' = v]r ] Q — f t -\ 77 ; ; x y = x 7 + 77 

8. Reorthogonalization: If i is a multiple of fe, then reorthonormalize v, against 
vi , v% , . . . , v,_i and reorhtogonalize / against v 1 , v 2 , . . , v,- . 

9. Finish iteration and go to next iteration: ft = ||/||;v !+1 = //ft. 
if j < m, set j = i + 1 and go to step 5. 



The value of m is chosen in such a way that the first system is solved past the convergence 
limit that we require for the subsequent systems. This is to ensure that the Krylov subspace is large 
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enough to contain accurate eigenvectors with small eigenvalues. The reorthogonalization frequency 
is chosen to balance the cost of reorthognalization during the solution of the first right-hand side 
and the gain we get for the subsequent systems. In most cases we have many right-hand sides to 
solve so this reorthogonalization cost is not significant. 

3. A Lattice QCD Example 

The method is tested for Wilson fermions on a 20 3 x 32 quenched Wilson configuration at 
j8 = 6.0 . We choose K to be essentially Kcritical ~ 0.15720 to make the problem difficult. Since 
we seed only once, we need to consider only two sample right-hand sides. These were taken to 
be point sources with Dirac and color indices 1 , 1 and 1 , 2 respectively. We solve the even-odd 
preconditioned systems Ax J = (j> ] ; j = 1,2 with A non-Hermitian. We convert this problem into 
a Hermitian positive definite problem as A^Ax-i = A**ty J . However, the residual norm that will be 
monitored for convergence will always be the one that corresponds to the original non-Hermitian 
system.The relative residual norm at convergence is chosen to be 10~ 8 . Standard CG for this 
problem requires Krylov subspace of dimension about 2,500 vectors (5000 matrix- vector products). 
In order to reduce the matrix-vector products for the second right-hand side it will be necessary to 
solve the first right-hand side past convergence. In Figure |], results for the second right-hand side 
is shown when the first right-hand side is solved using Krylov subspaces with dimensions 2500, 
3000, 3500, and 4000. For this test, we used k = 100 for the reorthogonalization frequency. Finding 
the largest value of k that could be used for this problem was done by experimenting with different 
choices of k values. In Figure ^, we show results for the second right-hand side using a subspace 
dimension 3500 and k values 10, 100, 200, and 250. As can be seen from Figure ||, a k value 
less than 250 is suitable for this problem. Having a large k value reduces the cost associated with 
reorthogonalization. 

The results for the second and additional right-hand sides shows a considerable gain in terms 
of the number of matrix- vector products of CG when using the initial guesses from the seeding. For 
the first right-hand side, there is additional overhead as compared to standard CG that comes from 
three sources: the need to store all Krylov subspace vectors, reorthogonalization, and the need to 
solve past the convergence level. There is also the cost of projecting over for subsequent right-hand 
sides (seeding), however, in our example we have only one additional right-hand side and this cost 
won't be discussed. The memory cost can be tolerated given that it is only needed during the solu- 
tion of the first right-hand side and should be allocated and deallocated at run time. This pays off, 
particularly when we have many right-hand sides. In the above example, the results were obtained 
using the high-performance computer cluster at Baylor University which has 8 processors per node 
at 2.66GHZ and 16GB RAM per node. The run was done on 50 processors using 5 processors per 
node allowing for 3.2 GB RAM per process. It was possible to allocate up to 4000 Krylov vectors 
(every vector is distributed over the 50 processes) using this configuration without a need for mem- 
ory swapping that leads to a slower run. The second overhead comes from the reorthogonalization 
step. The value of the reorthognalization frequency has to be found by experimenting in order to 
find the optimal value that reduces the reorthogonalization cost and at the same time increases the 
gain for matrix-vector products of subsequent right-hand sides. Using the same computing config- 
uration as described above we made a simple timing test for the cost of reorthogonalization. The 
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Matrix- Vector Products 

Figure 1: Results for the second right-hand side with seed-CG(m,100) on a 20 3 x 32 quenched Wilson 
configuration at j3 =6.0 and K = 0. 15720. Standard CG results are shown for comparison. 




Matrix-Vector Products 

Figure 2: Effect of changing the frequency of reorthogonalization with m=3500 for the second right-hand 
side on the same configuration. Standard CG results are shown for comparison. 



5 



Seed methods for linear equations in lattice qcd 



Abdou Abdel-Rehim 



Frequency of reorthogonalization 


time in CPU seconds 


10 


1253 


100 


598 


200 


320 


250 


305 


no reorthogonalization 


227 



Table 1: Timings for solving the first right-hand side with different reorthogonalization choices. 



results are shown in Table []]. Since values of reorthogonalization frequencies 10, 100, and 200 
seem to give the same improvement for the second right-hand side as can be seen in Figure ||, we 
can choose k = 200 and in this case the extra cost of the reorthogonalization step is less than 2 
times the cost of no reorthognalization. The final overhead to consider is the need to solve the first 
right-hand side past convergence. We found that in terms of matrix-vector products, this leads at 
most to a factor of 2 also. In our example, as can be seen from Figure [I], solving the first right- 
hand side using 3000, 3500, and 4000 vectors gives similar gains for the second right-hand side 
using 1000, 2000, and 3000 extra matrix- vector products respectively (standard CG for the first 
right-hand side uses 5000 matrix-vector products). So, this leads to a factor of 20%, 40%, and 
60% extra matrix-vector products. So, the extra cost of solving the first right-hand side associated 
with the reorthogonalization and the need to go past convergence leads to about factor of 2 times 
the cost of standard CG. This extra cost is acceptable given the gain we get when we have many 
right-hand sides. 
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